Large-Scale Neighbor-Joining with NINJA

نویسنده

  • Travis J. Wheeler
چکیده

Neighbor-joining is a well-established hierarchical clustering algorithm for inferring phylogenies. It begins with observed distances between pairs of sequences, and clustering order depends on a metric related to those distances. The canonical algorithm requires O(n) time and O(n) space for n sequences, which precludes application to very large sequence families, e.g. those containing 100,000 sequences. Datasets of this size are available today, and such phylogenies will play an increasingly important role in comparative genomics studies. Recent algorithmic advances have greatly sped up neighbor-joining for inputs of thousands of sequences, but are limited to fewer than 13,000 sequences on a system with 4GB RAM. In this paper, I describe an algorithm that speeds up neighbor-joining by dramatically reducing the number of distance values that are viewed in each iteration of the clustering procedure, while still computing a correct neighbor-joining tree. This algorithm can scale to inputs larger than 100,000 sequences because of external-memory-efficient data structures. A free implementation may by obtained from http://nimbletwist.com/software/ninja.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Large - Scale Phylogeny Reconstruction

In this study we introduce two novel distance-based algorithms with provably high computational and statistical efficiency. Furthermore, we report the results of experiments simulating sequence evolution on large trees with 135, 500, and 1895 leaves showing high success rates of our algorithms for large mutation probabilities, and high success rates of the popular Neighbor-Joining algorithm for...

متن کامل

Fe b 20 06 WHY NEIGHBOR - JOINING WORKS

We show that the neighbor-joining algorithm is a robust quartet method for constructing trees from distances. This leads to a new performance guarantee that contains Atteson’s optimal radius bound as a special case and explains many cases where neighbor-joining is successful even when Atteson’s criterion is not satisfied. We also provide a proof for Atteson’s conjecture on the optimal edge radi...

متن کامل

FastJoin, an improved neighbor-joining algorithm.

Reconstructing the evolutionary history of a set of species is an elementary problem in biology, and methods for solving this problem are evaluated based on two characteristics: accuracy and efficiency. Neighbor-joining reconstructs phylogenetic trees by iteratively picking a pair of nodes to merge as a new node until only one node remains; due to its good accuracy and speed, it has been e...

متن کامل

Efficient Construction of accurate Multiple alignments and Large-Scale phylogenies

A central focus of computational biology is to organize and make use of vast stores of molecular sequence data. Two of the most studied and fundamental problems in the field are sequence alignment and phylogeny inference. The problem of multiple sequence alignment is to take a set of DNA, RNA, or protein sequences and identify related segments of these sequences. Perhaps the most common use of ...

متن کامل

Neighbor Joining and Maximum Likelihood with RNA Sequences: Addressing the Interdependence of Sites

Intrastrand base pairings give ribosomal and other RNA molecules characteristic structures that are important for their function. In order to maintain these structures, a substitution at one paired site may have to be compensated for by an appropriate substitution at the complementary site. Thus paired sites do not evolve independently of one another. Most current methods for inferring phylogen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009